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The Ising model is a useful tool for studying complex interactions 
within a system. The estimation of such a model, however, is rather 
challenging, especially in the presence of high-dimensional parame- 
ters. In this work, we propose efficient procedures for learning a sparse 
Ising model based on a penalized composite conditional likelihood 
with nonconcave penalties. Nonconcave penalized likelihood estima- 
tion has received a lot of attention in recent years. However, such 
an approach is computationally prohibitive under high-dimensional 
Ising models. To overcome such difficulties, we extend the methodol- 
ogy and theory of nonconcave penalized likelihood to penalized com- 
posite conditional likelihood estimation. The proposed method can be 
efficiently implemented by taking advantage of coordinate-ascent and 
minorization-maximization principles. Asymptotic oracle properties 
of the proposed method are established with NP-dimensionality. Op- 
timality of the computed local solution is discussed. We demonstrate 
its finite sample performance via simulation studies and further illus- 
trate our proposal by studying the Human Immunodeficiency Virus 
type 1 protease structure based on data from the Stanford HIV drug 
resistance database. Our statistical learning results match the known 
biological findings very well, although no prior biological information 
is used in the data analysis procedure. 

1. Introduction. The Ising model was first introduced in statistical phys- 
ics [Ising (1925)] as a mathematical model for describing magnetic interac- 
tions and the structures of ferromagnetic substances. Although rooted in 
physics, the Ising model has been successfully exploited to simplify complex 
interactions for network exploration in various research fields such as social- 
economics [Stauffer (2008)], protein modeling [Irback, Peterson and Potthast 
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(1996)] and statistical genetics [Majewski, Li and Ott (2001)]. Following the 
terminology in physics, consider an Ising model with K magnetic dipoles 
denoted by Xj, l<j< K. Each Xj equals +1 or —1, corresponding to the 
up or down spin state of the jth magnetic dipole. The energy function is 
defined as E = — ^Zi^j Pij \ 3 , where the coupling coefficient describes 
the physical interactions between dipoles i and j under the external mag- 
netic field, /3jj = and /3™ = (3ji for any According to Boltzmann's law, 
the joint distribution of X = (X\, . . . , Xk) should be 

(1.1) Pr^ = Xl ,...,X K = x K ) = -Lrexpfc ^p) , 

where Z((3) is the partition function. 

In this paper we focus on learning sparse Ising models; that is, many 
coupling coefficients are zero. Our research is motivated by the HIV drug 
resistance study where understanding the inter-residue couplings (interac- 
tions) could potentially shed light on the mechanisms of drug resistance. 
A suitable statistical learning method is to fit a sparse Ising model to the 
data, in order to discover the inter-residue couplings. More details are given 
in Section 5. In the recent statistical literature, penalized likelihood estima- 
tion has become a standard tool for sparse estimation. See a recent review 
paper by Fan and Lv (2010). In principle we can follow the penalized like- 
lihood estimation paradigm to derive a sparse penalized estimator of the 
Ising model. Unfortunately, the penalized likelihood estimation method is 
very difficult to compute under the Ising model because the partition func- 
tion Z{j3) is computationally intractable when the number of dipoles is 
relatively large. On the other hand, the composite likelihood idea [Lindsay 
(1988), Varin, Reid and Firth (2011)] offers a nice alternative. To elabo- 
rate, suppose we have A independent identically distributed (i.i.d.) realiza- 
tions of X from the Ising model, denoted by {(^i n , . . . , XKn), n = 1, . - - , A}. 
Let 6j = P(Xi = Xj|X(_j)), describing the conditional distribution of the 
jth dipole given the remaining dipoles, where X(_f) denotes X with the 
jth element removed. By (1.1), it is easy see that for the nth observa- 
tion, 

ex P(Sfc : k^j PjkXjnXkn) 
ex P(X/fc : kj^j fijk x jn x kn) + 1 

Note that 6j n does not involve the partition function. The conditional log- 
likelihood of the jth dipole, given the remaining dipoles, is given by 

1 N 

n=l 
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As in Lindsay (1988) a composite log-likelihood function can be denned as 

A 



This kind of composite conditional likelihood was also called pseudo-likeli- 
hood in Besag (1974). Another popular type of composite likelihood is com- 
posite marginal likelihood [Varin (2008)]. Maximum composite likelihood is 
especially useful when the full likelihood is intractable. Such an approach has 
important applications in many areas including spatial statistics, clustered 
and longitudinal data and time series models. A nice review on the recent 
developments in composite likelihood can be found in Varin, Reid and Firth 
(2011). 

To estimate a high-dimensional sparse Ising model, we consider the fol- 
lowing penalized composite likelihood estimator: 

( K K \ 



(1.2) 3 = argmax<k(/3)-^ £ P A (|/3 jfc |) 



=i k=j+i 



where P\(t) is a positive penalty function defined on [0, oo). In this work 
we focus primarily on the LASSO penalty [Tibshirani (1996)] and smoothly 
clipped absolute deviation (SCAD) penalty [Fan and Li (2001)]. The LASSO 
penalty is P\(t) = Xt. The SCAD penalty is defined by 

P' x (t) = \ll(t < A) + ^~f)+ I(t > A)}, t > 0;a > 2. 

Following Fan and Li (2001) we set a = 3.7. We should make it clear that 
when P\(t) is nonconcave, (3 should be understood as a good local maximizer 
of (1.2). See discussions in Section 2. 

The optimization problem in (1.2) is very challenging because of two ma- 
jor issues: (1) the number of unknown parameters is \K(K — 1), and hence 
the optimization problem is high dimensional in nature; and (2) the penalty 
function is concave and nondifferentiable at zero, although t c is a smooth 
concave function. We propose to combine the strengths of coordinate-ascent 
and minorization-maximization, which results in two new algorithms, CMA 
and LLA-CMA, for computing a local solution of the nonconcave penalized 
composite likelihood. See Section 2 for details. With the aid of the new al- 
gorithms, the SCAD penalized estimators are able to enjoy computational 
efficiency comparable to that of the LASSO penalized estimator. 

Fan and Li (2001) advocated the oracle properties of the nonconcave pe- 
nalized likelihood estimator in the sense that it performs as well as the 
oracle estimator which is the hypothetical maximum likelihood estimator 
knowing the true submodel. Zhang (2010a) and Lv and Fan (2009) were 
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among the first to study the concave penalized least-squares estimator with 
NP-dimensionality (p can grow faster than any polynomial function of n). 
Fan and Lv (2011) studied the asymptotic properties of nonconcave penal- 
ized likelihood for generalized linear models with NP-dimensionality. In this 
paper we show that the oracle model selection theory remains to hold nicely 
for nonconcave penalized composite likelihood with NP-dimensionality. Fur- 
thermore, we show that under certain regularity conditions the oracle esti- 
mator can be attained asymptotically via the LLA-CMA algorithm. 

There is some related work in the literature. Ravikumar, Wainwright 
and Lafferty (2010) viewed the Ising model as a binary Markov graph and 
used a neighborhood LASSO-penalized logistic regression algorithm to select 
the edges. Their idea is an extension of neighborhood selection by LASSO 
regression proposed by Meinshausen and Biihlmann (2006) for estimating 
Gaussian graphical models. Hoffing and Tibshirani (2009) suggested using 
the LASSO-penalized pseudo-likelihood to estimate binary Markov graphs. 
However, they did not provide any theoretical result nor application. In this 
paper we compare the LASSO and the SCAD penalized composite likelihood 
estimators and show the latter has substantial advantages with respect to 
both numerical and theoretical properties. 

The rest of this paper is organized as follows. In Section 2, we introduce 
the CMA and LLA-CMA algorithms. The statistical theory is presented in 
Section 3. Monte Carlo simulation results are shown in Section 4. In Section 5 
we present a real application of the proposed method to study the network 
structure of the amino-acid sequences of retroviral proteases using data from 
the Stanford HIV drug resistance database. Technical proofs are relegated 
to the Appendix. 

2. Computing algorithms. In this section we discuss how to efficiently 
implement the penalized composite likelihood estimators. As mentioned be- 
fore, the computational challenges come from (1) penalizing the concave 
composite likelihood with a nonconcave penalty which is not differentiable 
at zero; (2) the intrinsically high dimension of the unknown parameters. Zou 
and Li (2008) proposed the local linear approximation (LLA) algorithm to 
derive an iterative l\ -optimization procedure for computing nonconcave pe- 
nalized estimators. The basic idea behind LLA is the minorization-maximiza- 
tion principle [Lange, Hunter and Yang (2000), Hunter and Lange (2004), 
Hunter and Li (2005)]. Coordinate-ascent (or descent) algorithms [Tseng 
(1988)] have been successfully used for solving penalized estimators with 
LASSO-type penalties; see, for example, Fu (1998), Daubechies, Defrise and 
De Mol (2004), Genkin, Lewis and Madigan (2007), Yuan and Lin (2006), 
Meier, van de Geer and Biihlmann (2008), Wu and Lange (2008) and Fried- 
man, Hastie and Tibshirani (2010). In this paper we combine the strengths 
of minorization-maximization and coordinatewise optimization to overcome 
the computational challenges. 
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2.1. The CM A algorithm. Let (3 be the current estimate. The coordinate- 
ascent algorithm sequentially updates fyj by solving the following univariate 
optimization problem: 

(2.1) (3 jk ^argmax{4(/3 ifc ;/3^ =^,(/,fc') ¥= (j,k)) - P x (\(3 jk \)}. 

Pit. 

However, we do not have a closed-form solution for the maximizer of (2.1). 
The exact maximization has to be conducted by some numerical optimiza- 
tion routine, which may not be a good choice in the coordinate-ascent algo- 
rithm because the maximization routine needs to be repeated many times to 
reach convergence. On the other hand, one can find an update to increase, 
rather than maximize, the objective function in (2.1), maintaining the cru- 
cial ascent property of the coordinate-ascent algorithm. This idea is in line 
with the generalized EM algorithm [Dempster, Laird and Rubin (1977)] in 
which one seeks to increase the expected log likelihood in the M-step. 
First, we observe that for any ftij 

( 2 - 2 ) = £(**»(i - + Mi - M) > -\- 

jk n=i 

Thus, by Taylor's expansion, we have 

4(/3jfc; Pfk> = Pj>k', (f, k') / (j, fc)) > QiPjk), 



where 
(2.3) 



Q(Pjk) = ?c(/3jk = Pjk; Pyw = Pj'k', if, k') / (j, k)) 



(2.4) z 3 



jk 



dim 



■jr^x kn x jn (2 - 6 kn (/3) -9 jn 0)). 



d(3 jk 

Next, Zou and Li (2008) showed that 

(2.5) P x (\p jk \) < P X {\P jk \) + P' x {\P jk \) • fl&fcl - l^-fcl) = L(\p jk \). 

Combining (2.3)-(2.5) we see that Q((3j k ) — L(\f3j k \) is a minorization func- 
tion of the objective function in (2.1). We update f3j k by 

(2.6) = argmax{Q(/3 jfc ) - L(\f3 jk \)}, 

whose solution is given by /3^ w = S(j3 jk + 2z jk ,2P' x (\/3 jk \)) where S(r,t) = 
sgn(r)(|r| — 1)+ denotes the soft-thresholding operator [Tibshirani (1996)]. 
The above arguments lead to Algorithm 1 below, which we call the coordinate- 
minorization-ascent (CMA) algorithm. 
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Algorithm 1 The CMA algorithm 

(1) Initialization of (3. 

(2) Cyclic coordinate-minorization-ascent: sequentially update (1 < 
j<k<K) via soft-thresholding j3 jk <= S0 jk + 2z jk , 2P' x (\(3 jk \)). 

(3) Repeat the above cycle till convergence. 



Remark 1 . It is easy to prove that Algorithm 1 has a nice ascent prop- 
erty which is a direct consequence of the minorization-maximizaton prin- 
ciple. Note that Algorithm 1 can be directly used to compute the LASSO- 
penalized composite likelihood estimator. We simply modify the coordinate- 
wise updating formula as [3j k <t= S(f3j k + 2zj k , 2A). 

In practice we need to specify the A value. BIC has been shown to per- 
form very well for selecting the tuning parameter of the penalized likelihood 
estimator [Wang, Li and Tsai (2007)]. The BIC score is defined as 



BIC is used to tune all methods considered in this work. We use SCAD1 to 
denote the SCAD solution computed by Algorithm 1 with the BIC tuned 
LASSO solution being the starting value. 

For computational efficiency considerations, we implement Algorithm 1 
by using the path- following idea and some other tricks, including warm- 
starts and active-set-cycling [Friedman, Hastie and Tibshirani (2010)]. We 
have implemented the algorithm in R language functions. The core cyclic 
coordinate- wise soft-thresholding operations were carried out in C. 

Remark 2. As suggested by a referee, the coordinate-gradient-ascent 
(CGA) algorithm is a natural alternative to Algorithm 1 for solving the 
LASSO-penalized composite likelihood estimator. The CGA algorithm has 
successfully used to solve other penalized models. See Genkin, Lewis and 
Madigan (2007), Meier, van de Geer and Biihlmann (2008), Stadler, Biihl- 
mann and van de Geer (2010) and Schelldorfer, Biihlmann and van de Geer 
(2011). In the CGA algorithm we need to find a good step size along the 
gradient direction to guarantee the ascent property after each coordinate- 
wise update. These extra computations are necessary for the CGA algorithm, 
but are not needed in the CMA algorithm. We have also implemented the 
CGA algorithm to solve the LASSO estimator and found that the CMA 
algorithm is about five times faster than the CGA algorithm. See Section 4 
for the timing comparison details. 

2.2. Issues of local solution and the LLA-CMA algorithm. The objective 
function in (1.2) is generally nonconcave if a nonconcave penalty function is 



(2.7) 
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used. Using Algorithm 1 we find a local solution to (1.2), but there is no guar- 
antee that it is the global solution. A similar case is Schelldorfer, Buhlmann 
and van de Geer (2011) where the objective function is the LASSO-penalized 
maximum likelihood of a high-dimensional linear mixed-effects model, and 
the authors derived a coordinate-wise gradient descent algorithm to find 
a local solution. 

It should not be considered as a special weakness of Algorithm 1 or other 
coordinate-wise descent algorithm as in Schelldorfer, Buhlmann and van de 
Geer (2011) that the algorithm can only find a local solution, because in 
the current literature there is no algorithm that can guarantee to find the 
global solution of nonconcave maximization (or nonconvex minimization) 
problems, especially when the dimension is huge. Consider, for example, the 
EM algorithm, which is perhaps the most famous algorithm in statistical lit- 
erature. The EM algorithm often offers an elegant way to fit some statistical 
models that are formulated as nonconcave maximization problems. However, 
the EM algorithm provides a local solution in general. A recent application 
of the EM algorithm to high-dimensional modeling can be found in Stadler, 
Buhlmann and van de Geer (2010) who considered a LASSO-penalized maxi- 
mum likelihood estimator of a high-dimensional linear regression model with 
inhomogeneous errors that are modeled by a finite mixture of Gaussians. To 
handle the computational challenges in their problem, Stadler, Buhlmann 
and van de Geer (2010) proposed a generalized EM algorithm in which a co- 
ordinate descent loop is used in the M-step and showed that the obtained 
solution is a local solution. 

Our numerical results show that in the penalized composite likelihood 
estimation problem the SCAD performs much better than the LASSO. To 
offer theoretical understanding of their differences, it is important to show 
that the obtained local solution of the SCAD-penalized likelihood has bet- 
ter theoretical properties than the LASSO estimator. In Section 3 we estab- 
lish the asymptotic properties of the LASSO estimator and a local solution 
of (1.2) with the SCAD penalty. However, a general technical difficulty in 
nonconcave maximization problems is to show that the computed local solu- 
tion is the one local solution with proven theoretical properties. In Stadler, 
Buhlmann and van de Geer (2010) and Schelldorfer, Buhlmann and van de 
Geer (2011), nice asymptotic properties are established for their proposed 
methods but it is not clear whether the computed local solutions could have 
those theoretical properties. The same issue exists in Fan and Lv (2011). 

To circumvent the technical difficulty, we can consider combining the LLA 
idea [Zou and Li (2008)] and Algorithm 1 to solve (1.2) with a nonconcave 
penalty. The LLA algorithm turns a nonconcave penalization problem into 
a sequence of weighted LASSO penalization problems. Similar ideas of it- 
erative LLA convex relaxation have been used in Candes, Wakin and Boyd 
(2008), Zhang (2010b) and Bradic, Fan and Wang (2011). Applying the LLA 
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Algorithm 2 The LLA-CMA algorithm 

(1) Initialize /3 {0) , and compute Wjk = P' x (\^\). 

(2) For m = 0, 1, 2, 3, . . . , repeat the LLA iteration: 

(2. a) Use Algorithm 1 to solve /3 (m+1) defined in (2.8); 
(2.b) Update the weights w jk by P' X {\P { % +1) \) ■ 



algorithm to (1.2), we need to iteratively solve 

{K K 
j=l k=j+l 

for m = 0, 1, 2, . . . where Wjk = P\{\$]^\)- Note that Algorithm 1 can be used 
to solve (2.8) by simply modifying the coordinate-wise updating formula 
as Pjk •<= S{f3jk + 2zjk, 2wjk)- Therefore, we have the following LLA-CMA 
algorithm for computing a local solution of (1.2). 

In Section 3 we show that if the LASSO estimator is then under 

certain regularity conditions the LLA-CMA algorithm finds the oracle esti- 
mator with high probability. These results suggest that we should take the 
following steps to compute the SCAD solution by the LLA-CMA algorithm. 

The proposed LLA-CMA procedure for computing a SCAD estimator: 

Step 1. Use Algorithm 1 to compute the LASSO solution path and find 
the LASSO estimator by BIC. 

Step 2. Use the LASSO estimator as /3 (0) in the LLA-CMA algorithm to 
compute the solution path of the first iteration and use BIC to tune the first 
step solution. Then use the tuned first step solution as (3^ in the LLA- 
CMA algorithm to compute the solution path and use BIC to select A. The 
resulting estimator is denoted by SCAD2. 

Step 3. For the chosen A of SCAD2, use Algorithm 2 to compute the fully 
converged SCAD solution with SCAD2 being the starting value. Denote this 
SCAD solution by SCAD2**. 

The construction of SCAD2 follows an idea in Buhlmann and Meier 
(2008). Based on our experience, SCAD2** works slightly better than SCAD2, 
but the two are generally very close. Generally we recommend using SCAD2** 
in real applications. 

3. Theoretical results. In this section we establish the statistical theory 
for the penalized composite conditional likelihood estimator using the SCAD 
and the LASSO penalty, respectively. Such results allow us to compare the 
SCAD and the LASSO estimators theoretically. 
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In order to present the theory we need some necessary notation. For a ma- 
trix A = (ciij), we define the following matrix norms: the Frobenius norm 



\\A\\p = ^Y,ijtfj, the entry-wise norm ||A|| max = maxjj \a,ij\ and the 

matrix norm = maxj \aij\. Let (3* = {fi* k '-j < k} denote the 

true coefficients, A = {(j,k) : (3* k ^ 0, j < k} and s = \A\. Define p(s,N) = 

min(j jfc ) g _4 \/3j k \ which represents the weakness of the signal. Let H be the 
Hessian matrix of £ c such that 

d 2 t c ((3) 



0'lfcl)>0'2fc 2 ) 



d/3 jlkl d/3 j2 k 2 

1 < ji < h < K and 1 < j 2 < k2 < K. For simplicity we use H* = H{(3*). We 
partition H and (3 according to A as h^^ c ) an< ^ P = {0Ai Pa c ) 7 

respectively. We let 

X_4 = (Xj : (j, k) or (k,j) £ A for some k) 

and 

*4n = ( x jn : (j, fc) or j) £ .4 for some fc). 
Finally, we define 

b = A min (£'[^_ 4 ]), 

B = A max (E'[X_4X^]), 

<P=\\E[H* ACA ](E[H* AA ])- 1 \\ 00 . 

Define the oracle estimator as /3 oraclc = (/3^ mle ,0) where 

^ mle = argmax4((/3^,0)). 
Pa 

If we knew the true submodel, then we would use the oracle estimator to 
estimate the Ising model. 

Theorem 3.1. Consider the SCAD -"penalized composite likelihood de- 
fined in (1.2). We have the following two conclusions: 

(1) For any R < -j^^- , we have 
(3.1) Pr(||3r C -^ll 2 < 




n 



with n =exp(-R 2 w) + 2s 2 exp(-f % ) + 2s 2 exp(-f f ). 

(2) Pick a A satisfying A < min(^|^, ^jg^ )• With probability at least 
1 — T 2; /3 oraclc is a local maximizer of the SCAD -penalized composite likeli- 
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hood estimator where 

( 2^ 2 \ 
t 2 = exp I -R* ^ ) + K exp 



iVA 2 



(3.2) 



+ exp 



+ 4s^ 



NX 



3B(2cf) + I)s8 3 



32(20 + 1) 
+ if 2 s exp 



Nb 2 
2s 3 " 



exp 



Nb 2 



s 2 2 



-qTt + ex P -To^r 



2\ 1 



s 2 8 



and i?* = min(± Jf p(s, TV), 



+ 2s exp 



b 2 N 
"8s 3 " 



We also analyzed the theoretical properties of the LASSO estimator. If 
the LASSO can consistently select the true model, it must equal to the 
hypothetical LASSO estimator ((3^,0) where 

3^ = argmax|4((/3 -A ,0)) -A ^ 



U,k)eA 



Theorem 3.2. Consider the LASSO-penalized composite likelihood es- 
timator. 

(1) Choose A such that As < §§. Pr(||/3^ - P* A h < ^¥^) ^ 1 ~ T 'i with 



T > 1 = e -NX*/2 + 2s 2 



exp 



-Nb< 
~2^~ 



+ exp 



-NB' 



(2) Assume the ir-representable condition cfi < 1 — rj < 1. Choose A suc/i 
i/iai As < min( ^g j^j, §§)■ T/ien (/3^,0) is i/ie LASSO-penalized compos- 
ite likelihood estimator with probability at least 1 — t' 2 , where 



T >- e -NX-/2 + K 2 s 



exp 



Nb 2 V 2 . ^ 2 
-£3-)+* exp 



iVA^r/ 



2^,2 



( 



+ 2s' 



exp 



2^2 



2s 3 (2-7/) 2 



+ exp 



2s 2 



+ exp 



Js^~ 



In Theorems 3.1 and 3.2 the three quantities b, B and <j) do not need 
to be constants. We can obtain a more straightforward understanding of 
the properties of the penalized composite likelihood estimators by consider- 
ing the asymptotic consequences of these probability bounds. To highlight 
the main point, we consider b, B and <j> are fixed constants and derive the 
following asymptotic results. 

Corollary 3.1. Suppose that b, B and cp are fixed constants and fur- 
ther assume N > s 3 log(K) and p(s,N) > y / log ^f ) . 
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(1) Pick the SCAD penalty parameter A scad satisfying 



2a ' 3s,B /' V N 

With probability tending to 1, the oracle estimator is a local maximizer of 
the SCAD-penalized estimator and ||/3^ raclc -(3* A \\ 2 = O p (^/J). 

(2) Assume the ir-representable condition in Theorem 3.2. Pick the LASSO 
penalty parameter A lasso satisfying 

mir Y_L p ( S) AT), I ) » A lasso » 



s s) 

then the LASSO estimator consistently selects the true model and ||/3j| sso — 
/ 3^|| 2 = P (A lasso ^). 

Remark 3. For the LASSO-penalized least squares, it is now known 
that the model selection consistency critically depends on the ir-representable 
condition [Zhao and Yu (2006), Meinshausen and Biihlmann (2006), Zou 
(2006)]. A similar condition is again needed in the LASSO-penalized com- 
posite likelihood. Furthermore, Corollary 3.1 shows that even when it is pos- 
sible for the LASSO to achieve consistent selection, A lasso should be much 

greater than which means that A 1 ^ -^ 3> yfjj- So the LASSO yields 
larger bias than the SCAD. 

Remark 4. We have shown that asymptotically speaking the oracle 
estimator is in fact a local solution of the SCAD-penalized composite like- 
lihood model. This property is stronger than the oracle properties defined 
in Fan and Li (2001). Our result is the first to show that the oracle model 
selection theory holds nicely for nonconcave penalized composite conditional 
likelihood models with NP-dimensionality. The usual composite likelihood 
theory in the literature is only applied to the fixed-dimension setting. Our 
result fills a long-standing gap in the composite likelihood literature. 

What we have shown so far is the existence of a SCAD-penalized estimator 
that is superior to the LASSO-penalized estimator. Moreover, we would like 
to show that the computed SCAD estimator is equal to the oracle estimator. 
As discussed earlier in Section 2.2, such a result is very difficult to prove due 
to the nonconcavity of the penalized likelihood function. See also Fan and 
Lv (2011), Stadler, Biihlmann and van de Geer (2010) and Schelldorfer, 
Biihlmann and van de Geer (2011). 

If one can prove that the objective function has only one maximizer, then 
the computed solution and the theoretically proven solution must be the 
same. This idea has been used in Fan and Lv (2011) to study the noncon- 
cave penalized generalized linear models and Bradic, Fan and Jiang (2011) 
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to study the nonconcave penalized Cox proportional hazards models. Their 
arguments are based on the observation that the SCAD penalty function has 
a finite maximum concavity [Zhang (2010a), Lv and Fan (2009)]. Hence, if 
the smallest eigenvalue of the Hessian matrix of the negative log-likelihood 
is sufficiently large, the overall penalized likelihood function is concave and 
hence has a unique global maximizer. This argument requires that the sam- 
ple size is greater than the dimension; otherwise, the Hessian matrix does not 
have full rank. To deal with the high-dimensional case, Fan and Lv (2011) 
further refined their arguments by considering a subspace denoted by § s , 
which is the union of all s-dimensional coordinate subspaces. Under some 
regularity conditions, Fan and Lv (2011) showed that the oracle estimator 
is the unique global maximizer in S s , which was referred to as restricted 
global optimality. Then by assuming that the computed solution has ex- 
actly s nonzero elements, it can be concluded that the computed solution is 
in S s and hence equals the oracle estimator; see Proposition 3.b of Fan and 
Lv (2011). However, a fundamental problem with these arguments is that we 
have no idea whether the computed solution selects s nonzero coefficients, 
because s is unknown. 

Here we take a different route to tackle the local solution issue. Instead of 
trying to prove the uniqueness of maximizer, we directly analyze the local 
solution by the LLA-CMA algorithm and discuss under which regularity 
conditions the LLA-CMA algorithm can actually find the oracle estimator. 

Theorem 3.3. Consider the SCAD -'penalized composite likelihood esti- 
mator in (1.2). Let /3 scad be the local solution computed by Algorithm 2 (the 
LLA-CMA algorithm) with /3(°) being the initial value. Pick a X satisfying 
A < min(^), Write r = Pr(p(°) - /T |U > A). 

(1) The LLA-CMA algorithm finds the oracle estimator after one LLA 
iteration with probability at least 1 — tq — t% where 

~ ( -NX 2 \ ( -NX b 2 \ - f-Nb 2 ' 

T3 = K exp T) + exp — — — —~\+Ks exp 



+ 2s 2 



32(20 + 1)V H V35(2<A+1>8V a ~*\ 2*3 
Nb 2 \ ( Nb 2 \ ( N B 2 



exp( -8^J +exp r^Tj +exp l s 2 



(2) The LLA-CMA algorithm converges after two LLA iterations and / £} scad 
equals the oracle estimator with probability at least 1 — tq — ti, where T2 is 
defined in (3.2). 

Theorem 3.3 can be used to drive the following asymptotic result. 

Corollary 3.2. Suppose that b, B and 4> are fixed constants, and fur- 
ther assume N > s 3 log(ET) and p(s, N) > max W lo sW^v^/>>) ^ Q ons ^ er t ^ e 
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SCAD -penalized composite likelihood estimator with the SCAD penalty pa- 
rameter A scad satisfying 



2a ' 3sB J' \ N 

(1) // ro —7-0, i/ien probability tending to one, the LLA-CMA algo- 
rithm converges after two LLA iterations and the LLA-CMA solution (or 
its one-step version) is equal to the oracle estimator. 

(2) Consider using the LASSO estimator as . Assume the ir-represent- 
able condition in Theorem 3.2, and pick the LASSO penalty parameter A lasso 
satisfying 

« A lasso « min (^=p(s, N), - 



^lasso ^ 



\ scad t 
i lasso A 



, s 16 

Then tq — > 0, and the conclusion in (1) holds. 

Remark 5. Part (1) of Corollary 3.2 basically says that any estima- 
tor that converges to (3* in probability at a rate faster than A scad can be 
used as the starting value in the LLA-CMA algorithm to find the oracle 
estimator with high probability. Note that such a condition is not very re- 
strictive. Part (2) of Corollary 3.2 shows that the LASSO estimator satisfies 
that condition. We could also consider using other estimators as the start- 
ing value in the LLA-CMA algorithm. For example, we can use the neigh- 
borhood selection estimator as f3^ . Following Ravikumar, Wainwright and 
Lafferty (2010) we assume an ir-representable condition for each of the K 
neighborhood LASSO-penalized logistic regression and some other regular- 
ity conditions. Then it is not hard to show that the neighborhood selection 
estimator is also a qualified starting value. In this work, we would like to 
faithfully follow the composite likelihood idea and hence prefer to use the 
LASSO-penalized composite likelihood estimator as the starting value in the 
LLA-CMA algorithm. 

4. Simulation. In this section we use simulation to study the finite sam- 
ple performance of the SCAD-penalized composite likelihood estimator. For 
comparison, we also include other two methods: neighborhood selection by 
LASSO-penalized logistic regression [Ravikumar, Wainwright and Lafferty 
(2010)] and the LASSO-penalized composite likelihood estimator. 

For each coupling coefficient pj^, the LASSO-penalized logistic method 

provides two estimates: fij^k based on the model for the jth dipole and flk^>j 
based on the model for the fcth dipole. Then we carry out two types of neigh- 
borhood selections: (i) aggregation by intersection (NSAI) based on /3^ SAI , 
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and (ii) aggregation by union (NSAU) based on /3^ SAU , where 



oNSAI 
Pjk 



>rk + Pkl-^j 



and 



/? fc SAU 



0. 



>k + Ph- 



if P jy 

if % 



if Pj^kPk^j = 

otherwise, 

¥k = and 
> k ^ and 
> fc = and 



= 0, 
= 0, 
^0, 



if Pj^kPk^i / 0. 



As suggested by a referee, the relaxed LASSO [Meinshausen (2007)] was 
used in neighborhood selection to try to improve its estimation accuracy. In 
each neighborhood logistic regression model, we first found a subset model by 
using the LASSO-penalized logistic regression. We re-estimated the nonzero 
coefficients via the unpenalized logistic regression on the subset model. 

BIC has been shown to perform very well for selecting the tuning parame- 
ter of the penalized likelihood estimator [Wang, Li and Tsai (2007), Stadler, 
Biihlmann and van de Geer (2010), Schelldorfer, Biihlmann and van de Geer 
(2011)]. We used BIC to tune all competitors. 

Two sparse Ising models were considered in our simulation. Their graphi- 
cal structure is displayed in Figure 1 where solid dots represent the dipoles, 
and two dipoles are connected if and only if their coupling coefficient is 
nonzero. We generated the nonzero coupling coefficients as follows. If dipoles i 



and j are connected, we let (3{j be tijS tJ 



where tij 



lowing the uniform distribution on [1,2] and Si 



is a random variable fol- 
is a Bernoulli variable with 
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Table 1 

Comparing different estimators using simulation models 1 and 2 with standard errors in 
the bracket. NSAI-relax and NSAU-relax mean that we use the relaxed LASSO to 
re-estimate the nonzero coefficients chosen by neighborhood selection method 







Model 1 






Model 2 




MSE 


NDE 


FDR 


MSE 


NDE 


FDR 


NSAI 


22.96 


138.9 


0.09 


8.16 


26.8 


0.16 




(0 1 81 


Cn 41 


(ft 01 1 


(ft 191 


(0.2) 


(0.01) 


NSAU 


17.34 


197.3 


0.36 


6.38 


39.7 


0.39 




(0.14) 


(1.0) 


(0.01) 


(0.16) 


(0.5) 


(0.01) 


LASSO 


21.33 


332.5 


0.62 


12.19 


117.1 


0.79 




(0.13) 


(3.8) 


(0.04) 


(0.12) 


(3.0) 


(0.05) 


SCAD1 


2.86 


145.0 


0.12 


5.64 


30.0 


0.22 




(0.10) 


(2.4) 


(0.01) 


(0.17) 


(1.8) 


(0.02) 


SCAD2 


2.43 


129.2 


0.07 


4.41 


26.1 


0.17 




(0.05) 


(0.5) 


(0.01) 


(0.13) 


(0.7) 


(0.02) 


SCAD2" 


2.42 


128.6 


0.06 


4.39 


25.7 


0.16 




(0.05) 


(0.5) 


(0.01) 


(0.13) 


(0.6) 


(0.02) 


NSAI-relax 


8.23 


138.9 


0.09 


6.34 


26.8 


0.16 




(0.13) 


(0.4) 


(0.01) 


(0.09) 


(0.2) 


(0.01) 


NSAU-relax 


4.44 


197.3 


0.36 


5.67 


39.7 


0.39 




(0.10) 


(0.4) 


(0.01) 


(0.10) 


(0.5) 


(0.01) 


Pr( Sij = 1) 


= Pr(sij = - 


■1) = 0.5. 


For each 


model, we 


used Gibbs sampling 



to generate 100 independent datasets consisting 300 observations. For com- 
parison, we use three measurements: the total number of discovered edges 
(NDE), the false discovery rate (FDR) and mean square errors (MSE). 
Based on Table 1, we make the following interesting observations: 

• NSAU, while selecting larger models than NSAI, provides more accurate 
estimation. Neighborhood selection outperforms the LASSO-penalized com- 
posite likelihood estimator. 

• Note that SCAD2** has the smallest MSE in both models. SCAD2** and 
SCAD2 gave almost identical results, and their improvement over SCAD1 
is statistically significant. All three SCAD solutions perform much better 
than the LASSO for fitting penalized composite likelihood in terms of 
estimation and selection. 

• The SCAD solutions and NSAI have similar model selection performance, 
but the SCAD is substantial better in estimation. Using the relaxed LASSO 
can improve the estimation accuracy of neighborhood selection methods, 
but their improved MSEs are still significantly higher than those of SCAD2 
and SCAD2**. 

In Table 2 we compare the run times of the three methods. LASSO-CGA 
denotes the coordinate gradient ascent algorithm for computing the LASSO 
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Table 2 

Total time (in seconds) for computing solutions at 100 penalization parameters, averaged 
over 3 replications. Timing was carried out on a laptop with an Intel Core 1.60 GHz 
processor. LASSO-CGA denotes a coordinate gradient ascent algorithm for computing 
the LASSO-penalized composite likelihood estimator. The timing of SCAD1, SCAD2 and 
SCAD2** includes the timing for computing the starting value 



(N,p) 


Neighborhood 
selection 


LASSO 


SCAD1 


SCAD2 


SCAD2** 


LASSO-CGA 


Model 1 
(300, 7875) 


51.1 


32.7 


67.9 


84.7 


95.1 


179.8 


Model 2 
(300,5356) 


29.8 


16.0 


34.8 


42.6 


51.2 


89.6 



estimator. The computing time is about five times longer than that used by 
the CMA algorithm. Compared to the LASSO case, the run time for fitting 
the SCAD model is doubled or tripled, but it is still very manageable for 
the high-dimensional data. 

5. Stanford HIV drug resistance data. We also illustrate our methods 
in a real example using a HIV antiretroviral therapy (ART) susceptibility 
dataset obtained from the Stanford HIV drug resistance database. Details 
of the database and related data sets can be found in Rhee et al. (2006). 
The data for analysis consists of virus mutation information at 99 protease 
residues (sites) for N = 702 isolates from the plasma of HIV-l-infected pa- 
tients. This dataset has been previously used in Rhee et al. (2006) and Wu, 
Cai and Lin (2010) to study the association between protease mutations and 
susceptibility to ART drugs. 

A well recognized problem with current ART treatment such as Pis for 
treating HIV is that individuals who initially respond to therapy may de- 
velop resistance to it due to viral mutations. HIV-1 protease plays a key role 
in the late stage of viral replication and its ability to rapidly acquire a variety 
of mutations in response to various Pis confers the enzyme with high resis- 
tance to ARTs. A high cooperativity has been observed among drug-resistant 
mutations in HIV-1 protease [Ohtaka, Schon and Freire (2003)]. The se- 
quence data retrieved from treated patients is likely to include mutations 
that reflect cooperative effects originating from late functional constraints, 
rather than stochastic evolutionary noise [Atchley et al. (2000)]. However, 
the molecular mechanisms of drug resistance is yet to be elucidated. It is 
thus of great interest to study inter-residue couplings which might be rele- 
vant to protein structure or function and thus could potentially shed light 
on the mechanisms of drug resistance. We apply the proposed method to 
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Table 3 

Application to HIVRT data. NSE is the number of "stable edges. " E[V] is the expected 
number of falsely selected edges. Its upper bounds were computed by Theorem 1 in 
Meinshausen and Biihlmann (2010) 





NSAI 


NSAU 


LASSO 


SCAD1 


SCAD2 


SCAD2** 


NDE 


57 


305 


631 


101 


141 


132 


ME 


26.38 


36.34 


18.35 


18.30 


16.76 


16.74 






Stability selection 








NSE (7r thr = 0.9) 


15 


63 


160 


17 


20 


20 


E[V] 


< 3.2 


<48 


< 147.5 


<4.3 


<8.0 


<7.2 















the protease sequence data to investigate such inter-residue contacts. Our 
analysis only included K = 79 of the 99 residues that contain mutations. 

We split the data into a training set with 500 data and a test set with 
202 data. Model fitting and selection were done on the training set and the 
test data were used to compare the model errors. For a given estimate (3 
obtained from the training set, its model error is gauged by the value of 
composite likelihood evaluated on the test set, that is, 

1 202 79 

me(3) = -^ c est (3) = -202 ££log(M3))- 

n=l j'=l 

We report the analysis results in Table 3. There are total 3081 coupling 
coefficients to be estimated. Graphical presentations of the selected models 
are shown in Figure 2. Note that SCAD2 and SCAD2** again gave almost 
identical results and performed better SCAD1. We also performed stability 
selection [Meinshausen and Biihlmann (2010)] on each method to find "sta- 
ble edges." A remarkable property of stability selection is that under some 
suitable conditions stability selection achieves finite sample control over the 
expected number of false discoveries in the set of "stable edges." We use the 
SCAD selector to explain the stability selection procedure. We took a ran- 
dom subsample of size 250 and fitted the SCAD model. The process was 
repeated 100 times. On average, SCAD1 selected 103.1 edges, SCAD2 se- 
lected 140.7 edges and SCAD2** chose 133.4 edges. For each coefficient f3 jk 

we computed its frequency of being selected, denoted by Hjk- The set of 
"stable edges" is defined as {(k,j) :Hkj > ^thr}- In Table 3, we report the 
results using the threshold 7r t h r = 0.9, as suggested by Meinshausen and 
Biihlmann (2010). Stability selection found 17 edges in the SCAD1. SCAD2 
and SCAD2** selected the same 20 stable edges. By Theorem 1 in Mein- 
shausen and Biihlmann (2010), among these 17 stable edges selected by 
SCAD1, the expected number of false discoveries is no greater than 4.3, and 



18 



L. XUE, H. ZOU AND T. CAI 




(Al) The LASSO model. (A2) "Stable edges" in the LASSO 



model. 




(Bl) The SCAD2** model. (B2) "Stable edges" in the SCAD2** 

model. 




(CI) The model by neighborhood (C2) "Stable edges" in neighborhood 
selection. selection. 

Fig. 2. Shown in the left three panels (Al), (Bl), (CI) are the selected models by BIG 
The right three panels (A2), (B2), (C2) show the stability selection results using n t ^ r = 0.9. 

among the 20 stable edges selected by SCAD2 or SCAD2**, the expected 
number of false discoveries is at most 7.2. Likewise, we did stability selection 
with the LASSO selector and neighborhood selection, and the results are re- 
ported in Table 3 as well. Figure 2 shows the "stable edges" by stability 
selection. We see that the computed upper bounds are very useful for the 
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SCAD selector and NSAI and not so informative for the LASSO selector and 
NSAU. Interestingly, both NSAI and SCAD suggest there are about 12 true 
discoveries by stability selection. In fact, we found that NSAI and SCAD1 
have 11 "stable edges" in common, and NSAI and SCAD2 (or SCAD2**) 
have 12 "stable edges" in common. 

These results are consistent with some of the previous findings. For ex- 
ample, it has long been known that co-substitutions at residues 30 and 88 
are most effective in reducing the susceptibility of nelfinavir [Liu, Eyal and 
Bahar (2008)]. Among the top 30 most common drug resistance mutations 
[Rhee et al. (2004)], 7 of those had a joint mutation at residues 54 and 82, 
the joint mutation at residues 88 and 30 was the second most common muta- 
tion among all. A co-mutation at residues 54, 82 and 90 was associated with 
high resistance to multiple drugs and an additional co-mutation at 46 was 
associated with an even higher level of resistance. It is interesting to note 
that using a larger set of isolates from treated HIV patients, Wu et al. (2003) 
reported (54, 82), (32, 47), (73, 90) as the three most highly correlated pairs. 
All these three pairs showed up as the stable edges in our analysis. Mutation 
at residue 71, often described as a compensatory or accessory mutation, has 
been reported as a critical mutation which appears to improve virus growth 
and contribute to resistance phenotype [Markowitz et al. (1995), Tisdale 
et al. (1995), Muzammil, Ross and Freire (2003)]. Accessory mutations con- 
tribute to resistance only when present with a mutation in the substrate 
cleft or flap or at residue 90 [Wu et al. (2003)]. The stable edges connect 
this accessory mutation with residues 90 and 54 (a flap residue), as well as 
with another flap residue at 46 through residue 10. 

APPENDIX: TECHNICAL PROOFS 

Before presenting the proof, we first define some useful quantities. The 
score functions of the negative composite likelihood (—A?)) and the Hessian 
matrices are defined as follows: 

u) 8£^(J3^) 1 ^ 

i n=l 

HklM = ~d(3 jkl d(3 jk2 = I^ SfelflXto(1 " e ^ n ' hM^J- 

Similarly, let tp be the score function of — £ c such that ip(jk) = 9 ^ or ^ — 

(i) (k) 

j < k < K . By definition we have the following identities: ip(jk) = Wk 
In what follows we write ijj* =i/j((3*). 



Proof of Theorem 3.1. We first prove part (1). 
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Consider V(a A ) = —£ c (/3 A + d]sfOi A )+£ c (/3 A ) and its minimizer is a A mlc = 
^L^hmic _ p* A y By definition, V"(S^ rnlc ) < V(0) = 0. Fix any R > and 
consider any a A satisfying 1 1 ck^4 1 1 2 — R- Using Taylor's expansion, we know 
that, for some t £ [0, 1] and (3(t) = [3 A + td^cxj^, 

V(a A ) = d N a A ip A + \d 2 N cx A H AA OL A 

(A.l) + \d 2 N cx T A [H AA (P(t)) ~ H\ A ]cx A 

= T 1 + T 2 + T Z . 

Note that E[^)* A ] = and Halloo < 2. By the Cauchy-Schwarz inequality, 
l a S^I — 2\/si2. Using HoefFding's inequality, we have 

(A.2) 



/ Ne 2 \ 

Pr^-e^expf-^j. 



For the second term, we first have T 2 > ?f\ min (H AA )R 2 . Each entry of H* 
is between — | and \. Thus HoefFding's inequality and the union bound yield 

So by the inequality A m i n (/T^) > b — \\H AA — E[H aa ]\\f, we have 

Nb 2s 



Pr \\H. 



(A.3) 



Pr(T 2 > djfbBr/4) >l-2s 2 exp 



2s 2 



For \T 3 \, let A max (^ £) n=1 x^ n x^J = B N . Define T?jn(/3) = 0j n (l-0j n )(20j n - 
1). Using the mean value theorem, we have that, for some t' G [0,t] and 
/3(t / ) = /3^ + t / ^a^, 



r/ 3 



1 K ( 



(A.4) 



< 



(j,fc)e.4 

In the last step we have used |^j n ,(/3(t'))| < | for any j and a^c = 0. More- 
over, B N < B + \\jjJ2n=i*An* An ~ E[xa*a\\\f- Since x jn = ±l, we apply 
HoefFding's inequality and the union bound to obtain the following proba- 
bility bound: 



Pr 



( 1 N 

V n=l 



>B/2 < 2s 2 exp 
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which leads to 

(A.5) Pr(|T 3 |<^^ 3 ) >l-2, 2 exp(-^). 

Taking R<^^- and combining (A.2) (A.3) and (A.5), we have 

T 1 +T 2 + T z > ^-d 2 N - ^-R 3 d%^ > 
o o 

with probability at least 1 — t\. Thus, the convexity of V implies that 



Pr(||/3r c -^ll 2 <^)>l-ri. 



We now prove part (2). First, we show that if vam^^^ 1/3^™ | > a\ and 
||^(3° racle )l|oo < A, then 3 oraclc is a local maximizer of 4 (/3) -Ey.fc) p x(\M 
To see that, consider a small ball of radius t with / g oraclc being the center. 
Let /3 be any point in the ball. So ||/3 — /3 oracle ||2 < t. Clearly, for a sufficiently 
small t we have min^ .fc) g _4 \Pjk\ > aX and max^^g^c \Pjk\ < A. By Taylor's 
expansion we have 

im + e p ^A)\ - |-4(3 oraclc ) + E p A(i^r lc i) 

(7,*) 0,*) 

= - 3 hmle ) T ^(3 oraclc ) + 3 oracl0 ) T ^(/a / )(/3 - 3 oracle ) 

+ E A i^i 
> E (A-i^ fc) (3 oraclc )i)i^i>o. 

(j,fc)e.4 c 



A probability bound for the event of min^^g^ |/3^™ | > aX is given by 
nin |^ mlc | > aX] 



(A-6) >Pr(||3!T le -^ll2<y^^ 



/ 9 6 2 \ 2 / iV& 2 \ 2 / NB 2 
>l_ exp ^-^_j -2s exp^-^-)-2s exp^--^ — 

Now consider Pr(|| , t/ ; .4 c (/3 oraclc )||oo < A). There exists some t 6 [0, 1] such that 
(A.7) -0(3 oraclc ) = ip(/3*) + H*0 oiaclc - (3*) + r, 
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where r = {H((3* + i(3° raclc -/3*)) - H*)0 oiaclc -(3*). Note Vu(3° racl °) 
so 

^-^ = (iJ^)- 1 (-^- M ). 
Then ||^ c (3° raclc )||oo < A becomes 

II^m(#X4) _1 (-Vu-m) + Vu= +m c IIoo < a, 

which is guaranteed if 

(iiirwc^Oir'iioc + iximu + ikiioo) < a. 

Therefore we have a simple lower bound for Pr(||'0 v 4c(/3 oraclc )|| oo < A). 
Pr(||^(3° raclc )ll 00 <A) 

A 



0, 



> 1 - Pr(||^ cA (F^)- 1 || 00 > 20) -Prf 



> 



40 + 2 



Pr Iklloo > 



A 



' + 2 

Using Hoeffding's inequality and the union bound, we have 



(A. 



Pr 



< 



A 



> 1 - K 2 exp 



NX 2 



128(0 + 1/2) 5 



have a bound for rujA : 



Write a = /3 hmlc — (3*, and thus aj\c = 0. By the mean value theorem, we 
rr Uk y. 

1 N 

~77 ^ ] ^ ] ^ ] x knXjnXk2n^k'n < ^jk2 < ^jk , t Vjn(fi(t )) 

1 * 

~l 77 ^ ] ^ ^ ^ ^ Xj n Xf :n Xj 2n Xji n ai i: j 2 CXl i: jit Tjkn^Pft )) 



TV 

n=lj 2 ¥=kj'^k 

<B N .\\p A -f3* A \\ 2 2 . 

In the last step we have used \ fjj n ((3(t'))\ < | for any j and a A c = 0. More- 
over, recall that 



B N <B + 



1 N 



n=l 



Thus 



(A.9) 



Pr Iklloo < 



+ 2 



> 1 — exp 



2 s exp 



—NX b 2 
35(20 + l)s 83 

NB 2 



2s 2 exp ( 



V 2s 2 



8s 2 
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For notation convenience define c = IK-Ef-fr^J) -1 !^ — %/*IIC^[-^Cia]) lb 
and 

<5 = \\Ha c a( H Aa) 1 - E [H A c A ](E[H AA ]) W001 
h = \\{H* AA )- 1 -{E[H* AA ]y 1 \\ 00 , 
S 2 = \\H AA -E[H AA }\\ 00 , 

^3 = \\H% A - ^I^jW] Hoc- 
Then by definition 

5 = \\{H% A - E[H% A ]){{H\ A r l - (ElH^y 1 ) 

+ E[H* A c A ](E[H AA \r\-H AA + E[H* AA ]){H AA )- 1 

+ ( H %A ~ E[H A c A })(E[H AA }) 1 || 00 
<6 3 6 1 + <j>6 2 \\(H AA )- 1 \\ 00 + 5 3 c 
< 5 3 5! + <f)(c + 6 1 )6 2 + 5 3 c. 

Note that 

Si = \\{H* AA r\E[H AA ] - H* AA ){E[H AA \r l \ 



00 



< ik^a)" 1 ^ • \\e[h*aa] - h*aa\l ■ urn^y'i 

< (5i+c)5 2 c. 

Hence as long as 5 2 c < 1 we have 5i < pgj - and 5 < (5 3 + 4>5 2 )—_ 



8 2 c 



(A.10) 



h < j-) > 1 " Pr(\\H A cA ~ E[H* AcA }\\ max > 



> 1 - 2s 2 exp 



TV 



8c 2 s 2 / ' 



PrU 3 < f c J > 1 - PrI Hfl^ " E[H^a]\U, > ^ 

(A.ll) 



1 - JCaflffltP (-^)- 



Finally we have c < ^/b. Therefore, part (2) is proven by combining (A. 6), 
(A.8) (A.9) and (A.10), (A.ll). This completes the proof. □ 

Proof of Theorem 3.2. The proof is relegated to a supplementary 
file [Xue, Zou and Cai (2010)] for the sake of space. □ 

Proof of Corollary 3.1. It follows directly from Theorems 3.1 
and 3.2; thus we omit its proof here. □ 
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PROOF of Theorem 3.3. Under the event ||/3^ - /3*||t» < A, we have 
\pfk\ < A for (j, k) e A c and \pf^\ > aX for (j, k) G A. Therefore, is the 
solution of the following penalized composite likelihood: 

(A.12) 3«=argmaxj4(/3)-A E \(3 jk \\. 

It turns out that / g oracle is the global solution of (A.12) under the additional 
probability event that {||^(/3 oraclc )||oo < A}. To see this, we observe that 
for any /3, 

(-403)+ a E i^i)-(-4(3 oraclc )+A £ i^r le i) 

> E (A-|^ fc) (3 oraclc )|)-|/3 jfc | 
>0, 

where we used the convexity of —£ c . In the proof of Theorem 3.1 we have 
shown that 



Pr(||^ c (/3 orade )|| 00 >A) 



< K 2 exp I — 9 ) + exp 



iVA 2 \ / NX 



+ if 2 s exp 



32(20 + l) 2 J y \ 35(20 + I)s8 3 
Nb 2 



+ 2s 2 



T3- 



2s 3 

I-.2 i 



b z N\ f Nb z \ f NB 



exp( - 8? J +exp r^y +exp r^T 



Therefore, the LLA-CMA algorithm finds the oracle estimator with proba- 
bility at least 1 - r 3 - Pr(||/3( 0) - /3*||oo > A). This proves part (1). 

If we further consider the event {minQ- fc ) g _4 |/?^ aclc | > aX}. Then /3^ 2 ) is 
the solution of the following penalized composite likelihood max | g{4(/3) — 
AS(jfc)e^l c l^j'fcl}' which implies that = /3^\ and hence the LLA loop 
will stop. From (A. 6) we have obtained a probability bound for the event of 
{min meA |^ adc | < «A} as follows: 

Prf min j^ le | < aX 



< 

= T 4 . 



( <i&\ 2 ( Nb 2 \ 2 / NB 2 
exp^-i?^ j +2s exp^--^- j +2s exp^--^ — 
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Then we have (3^ = /3 (1) = 3° raclc for m = 2, 3, . . . which means the LLA- 
CMA algorithm converges after two LLA iteration and finds the oracle es- 
timator with probability at least 1 — T3 — Pr(||/3^ ^ — /3*||oo > A) — 74. Note 
that T3 + T4 = T2- This proves part (2). □ 

Proof of Corollary 3.2. Part (1) follows directly from Theorem 3.3. 
We only prove part (2). With the chosen A lasso , Theorem 3.2 shows that with 
probability tending to one, 3l4 SS ° = 3.4, P l A c S ° = and Pr(||/3^ - P* A h < 
16A lasso Vs/6) -> 0. Note that 16A lasso ^/& < A scad and A -P A \\oc < \\P A - 
(3* A \\ 2 , we then conclude r = Pr(||3 lasso - P*\\oo < A scad ) -»• 0. □ 
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SUPPLEMENTARY MATERIAL 

Supplementary materials for "Non-concave penalized composite likeli- 
hood estimation of sparse Ising models" (DOI: 10. 1214/12- AOS1017SUPP; 
.pdf). In this supplementary file, we provide a complete theoretical analy- 
sis of the LASSO-penalized composite likelihood estimator for sparse Ising 
models. 
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